{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ac9adb4",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/llama_2_llama_cpp.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "368686b4-f487-4dd4-aeff-37823976529d",
   "metadata": {},
   "source": [
    "# LlamaCPP \n",
    "\n",
    "In this short notebook, we show how to use the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library with LlamaIndex.\n",
    "\n",
    "In this notebook, we use the [`Qwen/Qwen2.5-7B-Instruct-GGUF`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF) model, along with the proper prompt formatting. \n",
    "\n",
    "By default, if model_path and model_url are blank, the `LlamaCPP` module will load llama2-chat-13B.\n",
    "\n",
    "## Installation\n",
    "\n",
    "To get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).\n",
    "\n",
    "Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).\n",
    "\n",
    "In general:\n",
    "- Use `CuBLAS` if you have CUDA and an NVidia GPU\n",
    "- Use `METAL` if you are running on an M1/M2 MacBook\n",
    "- Use `CLBLAST` if you are running on an AMD/Intel GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "271ff51f",
   "metadata": {},
   "source": [
    "For me, on a MAC, I need to install the `metal` backend.\n",
    "\n",
    "```bash\n",
    "CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c37fe74d",
   "metadata": {},
   "source": [
    "Then you can install the required llama-index pacakages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aff273be",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-embeddings-huggingface\n",
    "%pip install llama-index-llms-llama-cpp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7927630-0044-41fb-a8a6-8dc3d2adb608",
   "metadata": {},
   "source": [
    "## Setup LLM\n",
    "\n",
    "The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs.\n",
    "\n",
    "For any kwargs that need to be passed in during initialization, set them in `model_kwargs`. A full list of available model kwargs is available in the [LlamaCPP docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__init__).\n",
    "\n",
    "For any kwargs that need to be passed in during inference, you can set them in `generate_kwargs`. See the full list of [generate kwargs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).\n",
    "\n",
    "In general, the defaults are a great starting point. The example below shows configuration with all defaults."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59b27895",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2640c7a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_url = \"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fa0ec4f-03ff-4e28-957f-b4b99a0faa20",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "llama_model_load_from_file: using device Metal (Apple M2 Max) - 16584 MiB free\n",
      "llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /Users/loganmarkewich/Library/Caches/llama_index/models/qwen2.5-7b-instruct-q3_k_m.gguf (version GGUF V3 (latest))\n",
      "llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\n",
      "llama_model_loader: - kv   0:                       general.architecture str              = qwen2\n",
      "llama_model_loader: - kv   1:                               general.type str              = model\n",
      "llama_model_loader: - kv   2:                               general.name str              = qwen2.5-7b-instruct\n",
      "llama_model_loader: - kv   3:                            general.version str              = v0.1\n",
      "llama_model_loader: - kv   4:                           general.finetune str              = qwen2.5-7b-instruct\n",
      "llama_model_loader: - kv   5:                         general.size_label str              = 7.6B\n",
      "llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28\n",
      "llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072\n",
      "llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584\n",
      "llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944\n",
      "llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28\n",
      "llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4\n",
      "llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 1000000.000000\n",
      "llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001\n",
      "llama_model_loader: - kv  14:                          general.file_type u32              = 12\n",
      "llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2\n",
      "llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2\n",
      "llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,152064]  = [\"!\", \"\\\"\", \"#\", \"$\", \"%\", \"&\", \"'\", ...\n",
      "llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...\n",
      "llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = [\"Ġ Ġ\", \"ĠĠ ĠĠ\", \"i n\", \"Ġ t\",...\n",
      "llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645\n",
      "llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643\n",
      "llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643\n",
      "llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false\n",
      "llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\\n    {{- '<|im_start|>...\n",
      "llama_model_loader: - kv  25:               general.quantization_version u32              = 2\n",
      "llama_model_loader: - type  f32:  141 tensors\n",
      "llama_model_loader: - type q3_K:  113 tensors\n",
      "llama_model_loader: - type q4_K:   81 tensors\n",
      "llama_model_loader: - type q5_K:    3 tensors\n",
      "llama_model_loader: - type q6_K:    1 tensors\n",
      "llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG\n",
      "llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG\n",
      "llm_load_vocab: special tokens cache size = 22\n",
      "llm_load_vocab: token to piece cache size = 0.9310 MB\n",
      "llm_load_print_meta: format           = GGUF V3 (latest)\n",
      "llm_load_print_meta: arch             = qwen2\n",
      "llm_load_print_meta: vocab type       = BPE\n",
      "llm_load_print_meta: n_vocab          = 152064\n",
      "llm_load_print_meta: n_merges         = 151387\n",
      "llm_load_print_meta: vocab_only       = 0\n",
      "llm_load_print_meta: n_ctx_train      = 131072\n",
      "llm_load_print_meta: n_embd           = 3584\n",
      "llm_load_print_meta: n_layer          = 28\n",
      "llm_load_print_meta: n_head           = 28\n",
      "llm_load_print_meta: n_head_kv        = 4\n",
      "llm_load_print_meta: n_rot            = 128\n",
      "llm_load_print_meta: n_swa            = 0\n",
      "llm_load_print_meta: n_embd_head_k    = 128\n",
      "llm_load_print_meta: n_embd_head_v    = 128\n",
      "llm_load_print_meta: n_gqa            = 7\n",
      "llm_load_print_meta: n_embd_k_gqa     = 512\n",
      "llm_load_print_meta: n_embd_v_gqa     = 512\n",
      "llm_load_print_meta: f_norm_eps       = 0.0e+00\n",
      "llm_load_print_meta: f_norm_rms_eps   = 1.0e-06\n",
      "llm_load_print_meta: f_clamp_kqv      = 0.0e+00\n",
      "llm_load_print_meta: f_max_alibi_bias = 0.0e+00\n",
      "llm_load_print_meta: f_logit_scale    = 0.0e+00\n",
      "llm_load_print_meta: n_ff             = 18944\n",
      "llm_load_print_meta: n_expert         = 0\n",
      "llm_load_print_meta: n_expert_used    = 0\n",
      "llm_load_print_meta: causal attn      = 1\n",
      "llm_load_print_meta: pooling type     = 0\n",
      "llm_load_print_meta: rope type        = 2\n",
      "llm_load_print_meta: rope scaling     = linear\n",
      "llm_load_print_meta: freq_base_train  = 1000000.0\n",
      "llm_load_print_meta: freq_scale_train = 1\n",
      "llm_load_print_meta: n_ctx_orig_yarn  = 131072\n",
      "llm_load_print_meta: rope_finetuned   = unknown\n",
      "llm_load_print_meta: ssm_d_conv       = 0\n",
      "llm_load_print_meta: ssm_d_inner      = 0\n",
      "llm_load_print_meta: ssm_d_state      = 0\n",
      "llm_load_print_meta: ssm_dt_rank      = 0\n",
      "llm_load_print_meta: ssm_dt_b_c_rms   = 0\n",
      "llm_load_print_meta: model type       = 7B\n",
      "llm_load_print_meta: model ftype      = Q3_K - Medium\n",
      "llm_load_print_meta: model params     = 7.62 B\n",
      "llm_load_print_meta: model size       = 3.54 GiB (3.99 BPW) \n",
      "llm_load_print_meta: general.name     = qwen2.5-7b-instruct\n",
      "llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'\n",
      "llm_load_print_meta: EOS token        = 151645 '<|im_end|>'\n",
      "llm_load_print_meta: EOT token        = 151645 '<|im_end|>'\n",
      "llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'\n",
      "llm_load_print_meta: LF token         = 148848 'ÄĬ'\n",
      "llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'\n",
      "llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'\n",
      "llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'\n",
      "llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'\n",
      "llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'\n",
      "llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'\n",
      "llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'\n",
      "llm_load_print_meta: EOG token        = 151645 '<|im_end|>'\n",
      "llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'\n",
      "llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'\n",
      "llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'\n",
      "llm_load_print_meta: max token length = 256\n",
      "llm_load_tensors: tensor 'token_embd.weight' (q3_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead\n",
      "ggml_backend_metal_log_allocated_size: allocated buffer, size =  3402.97 MiB, ( 8663.83 / 21845.34)\n",
      "llm_load_tensors: offloading 28 repeating layers to GPU\n",
      "llm_load_tensors: offloading output layer to GPU\n",
      "llm_load_tensors: offloaded 29/29 layers to GPU\n",
      "llm_load_tensors: Metal_Mapped model buffer size =  3402.96 MiB\n",
      "llm_load_tensors:   CPU_Mapped model buffer size =   223.33 MiB\n",
      "...................................................................................\n",
      "llama_new_context_with_model: n_seq_max     = 1\n",
      "llama_new_context_with_model: n_ctx         = 16384\n",
      "llama_new_context_with_model: n_ctx_per_seq = 16384\n",
      "llama_new_context_with_model: n_batch       = 512\n",
      "llama_new_context_with_model: n_ubatch      = 512\n",
      "llama_new_context_with_model: flash_attn    = 0\n",
      "llama_new_context_with_model: freq_base     = 1000000.0\n",
      "llama_new_context_with_model: freq_scale    = 1\n",
      "llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized\n",
      "ggml_metal_init: allocating\n",
      "ggml_metal_init: found device: Apple M2 Max\n",
      "ggml_metal_init: picking default device: Apple M2 Max\n",
      "ggml_metal_init: using embedded metal library\n",
      "ggml_metal_init: GPU name:   Apple M2 Max\n",
      "ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)\n",
      "ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)\n",
      "ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)\n",
      "ggml_metal_init: simdgroup reduction   = true\n",
      "ggml_metal_init: simdgroup matrix mul. = true\n",
      "ggml_metal_init: has bfloat            = true\n",
      "ggml_metal_init: use bfloat            = false\n",
      "ggml_metal_init: hasUnifiedMemory      = true\n",
      "ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB\n",
      "ggml_metal_init: loaded kernel_add                                    0x4b795baa0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_add_row                                0x3443f6190 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sub                                    0x4b788d650 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sub_row                                0x4b787c770 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul                                    0x341732ec0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_row                                0x33f8a6f70 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_div                                    0x4b78710d0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_div_row                                0x4b7893f30 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_repeat_f32                             0x33e9d3420 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_repeat_f16                             0x33f8a7630 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_repeat_i32                             0x33f8a6170 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_repeat_i16                             0x3417904b0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_scale                                  0x4b795cdc0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_scale_4                                0x3417c1b60 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_clamp                                  0x33f8a8690 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_tanh                                   0x4b795c0a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_relu                                   0x33f8a8fd0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sigmoid                                0x4b795e020 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_gelu                                   0x3417f12d0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_gelu_4                                 0x4b787e370 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_gelu_quick                             0x34177c660 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_gelu_quick_4                           0x33f8a98f0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_silu                                   0x4b795e950 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_silu_4                                 0x3417e66c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_elu                                    0x4b795f280 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_soft_max_f16                           0x4b795f9a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_soft_max_f16_4                         0x4b7960070 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_soft_max_f32                           0x4b7960750 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_soft_max_f32_4                         0x4b7960de0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_diag_mask_inf                          0x33f8a68a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_diag_mask_inf_8                        0x4b7961350 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_f32                           0x33f8a9ed0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_f16                           0x3443f5b40 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_0                          0x3443f5ef0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_1                          0x33f8aaf60 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q5_0                          0x4b78728b0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q5_1                          0x34174bd90 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q8_0                          0x4b787be70 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q2_K                          0x341714490 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q3_K                          0x4b788ba40 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_K                          0x4b78921a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q5_K                          0x33f8ab590 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_q6_K                          0x33f8abc10 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq2_xxs                       0x33f8ac2d0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq2_xs                        0x33f8aca30 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq3_xxs                       0x4b7873110 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq3_s                         0x4b7961ae0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq2_s                         0x33f8ad0e0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq1_s                         0x33f8ad7c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq1_m                         0x341764d70 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq4_nl                        0x33f8ade30 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_iq4_xs                        0x4b7962160 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_get_rows_i32                           0x4b79627e0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_rms_norm                               0x4b88bf220 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_group_norm                             0x33e9d2fc0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_norm                                   0x33f8ae510 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_ssm_conv_f32                           0x4b7897c60 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_ssm_scan_f32                           0x4b78979c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_f32_f32                         0x4b7962e90 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)\n",
      "ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)\n",
      "ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)\n",
      "ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)\n",
      "ggml_metal_init: loaded kernel_mul_mv_f16_f32                         0x4b789aa80 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_f16_f32_1row                    0x33f8aeca0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_f16_f32_l4                      0x4b8b61390 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_f16_f16                         0x4b8b615f0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q4_0_f32                        0x4b8b619f0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q4_1_f32                        0x33f8af450 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q5_0_f32                        0x4b7885df0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q5_1_f32                        0x4b88be030 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q8_0_f32                        0x33f8afb30 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_2                0x33f8b01d0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_3                0x4b7888520 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_4                0x4b789ada0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_f16_f32_r1_5                0x3417bd850 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_2               0x33f8b0980 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_3               0x4b7963610 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_4               0x4b7963d90 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_0_f32_r1_5               0x4b7964520 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_2               0x341755a80 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_3               0x4b788a5e0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_4               0x3417629e0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_1_f32_r1_5               0x4b7876720 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_2               0x33f8b10c0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_3               0x33f8b1850 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_4               0x4b7870220 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_0_f32_r1_5               0x4b7828e60 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_2               0x33f8b1fa0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_3               0x4b7870b70 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_4               0x3417dd2e0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_1_f32_r1_5               0x34170cf50 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_2               0x4b781f450 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_3               0x33e9d3ed0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_4               0x33e9d4430 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q8_0_f32_r1_5               0x4b7964c30 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_2               0x4b7888a80 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_3               0x4b7965340 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_4               0x4b7965ae0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q4_K_f32_r1_5               0x4b7878ec0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_2               0x4b79661f0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_3               0x33f8b2630 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_4               0x4b7874000 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q5_K_f32_r1_5               0x4b7879c30 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_2               0x4b7966830 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_3               0x4b7885850 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_4               0x4b787f700 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_q6_K_f32_r1_5               0x4b787e940 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_2             0x33f8b2e20 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_3             0x33f8b3490 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_4             0x33f8b3ca0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_ext_iq4_nl_f32_r1_5             0x33f8b4310 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q2_K_f32                        0x33f8b4b20 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q3_K_f32                        0x4b78950e0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q4_K_f32                        0x4b7966f30 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q5_K_f32                        0x33f8b51a0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_q6_K_f32                        0x4b7967620 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq2_xxs_f32                     0x4b7850d90 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq2_xs_f32                      0x4b7967cf0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq3_xxs_f32                     0x33f8b5840 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq3_s_f32                       0x33f8b5f20 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq2_s_f32                       0x4b7968450 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq1_s_f32                       0x3417f4580 | th_max =  448 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq1_m_f32                       0x33f8b6640 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq4_nl_f32                      0x34179f610 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_iq4_xs_f32                      0x4b7968af0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_f32_f32                      0x3417a6d50 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_f16_f32                      0x4b79692c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q4_0_f32                     0x34179e960 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q4_1_f32                     0x33e9d4990 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q5_0_f32                     0x3417a3400 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q5_1_f32                     0x34172a1c0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q8_0_f32                     0x33e9d4ea0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q2_K_f32                     0x33e9d53b0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q3_K_f32                     0x4b7969990 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q4_K_f32                     0x33f8b6dd0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q5_K_f32                     0x33f8b74a0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_q6_K_f32                     0x3417238e0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq2_xxs_f32                  0x33f8b7bf0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq2_xs_f32                   0x34177e750 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq3_xxs_f32                  0x4b88be740 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq3_s_f32                    0x4b88c0550 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq2_s_f32                    0x4b78879e0 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq1_s_f32                    0x4b78752a0 | th_max =  448 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq1_m_f32                    0x4b796a0a0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq4_nl_f32                   0x33f8b83a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mv_id_iq4_xs_f32                   0x33f8b8b10 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_f32_f32                         0x33f8b92a0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x34172c720 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                        0x4b796a8b0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                        0x33f8b9960 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q5_0_f32                        0x33e9d58c0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q5_1_f32                        0x33e9d5de0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q8_0_f32                        0x3443f5480 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                        0x341771f90 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                        0x4b796afa0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                        0x4b796bb20 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                        0x3417557a0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                        0x4b789e1b0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq2_xxs_f32                     0x4b78855a0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq2_xs_f32                      0x4b88c0c90 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq3_xxs_f32                     0x4b8b621e0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq3_s_f32                       0x33f8ba020 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq2_s_f32                       0x34170b0b0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq1_s_f32                       0x4b88c1460 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq1_m_f32                       0x4b8b62560 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq4_nl_f32                      0x4b8b628e0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_iq4_xs_f32                      0x3443f3830 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_f32_f32                      0x4b796b540 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_f16_f32                      0x33f8ba950 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q4_0_f32                     0x341741360 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q4_1_f32                     0x4b7887390 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q5_0_f32                     0x4b78a1070 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q5_1_f32                     0x3417282d0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q8_0_f32                     0x4b781bc50 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q2_K_f32                     0x4b7881520 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q3_K_f32                     0x4b7880f50 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q4_K_f32                     0x33e9d5b20 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q5_K_f32                     0x4b7877720 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_q6_K_f32                     0x33f8bb150 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq2_xxs_f32                  0x33f8bb810 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq2_xs_f32                   0x33f8bc020 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq3_xxs_f32                  0x3417aac10 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq3_s_f32                    0x4b796c280 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq2_s_f32                    0x4b796ca30 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq1_s_f32                    0x3417714e0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq1_m_f32                    0x4b796d1a0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq4_nl_f32                   0x3417772f0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_mul_mm_id_iq4_xs_f32                   0x3417a9290 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_rope_norm_f32                          0x3417242b0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_rope_norm_f16                          0x341756f20 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_rope_neox_f32                          0x4b8b62e40 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_rope_neox_f16                          0x3443f2870 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_im2col_f16                             0x33e9d62f0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_im2col_f32                             0x33f8bc6a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_im2col_ext_f16                         0x4b8b63210 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_im2col_ext_f32                         0x33f8bced0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_conv_transpose_1d_f32_f32              0x4b8b63590 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_conv_transpose_1d_f16_f32              0x4b8b63910 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_upscale_f32                            0x4b796d8a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_pad_f32                                0x33f8bdb40 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_pad_reflect_1d_f32                     0x33f8bdda0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_timestep_embedding_f32                 0x33f8be950 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_arange_f32                             0x33e9d6e00 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_argsort_f32_i32_asc                    0x4b796e150 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_argsort_f32_i32_desc                   0x3443f4040 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_leaky_relu_f32                         0x4b796eb70 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h64                 0x4b796f260 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h80                 0x33f8bf140 | th_max =  640 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h96                 0x341764ae0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h112                0x3417941b0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h128                0x4b8b63fb0 | th_max =  512 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_f16_h256                0x33f8bf880 | th_max =  512 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h64                0x341716c00 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h80                0x3417560a0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h96                0x33e9d71a0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h112               0x33f8c0010 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h128               0x3443f19f0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_0_h256               0x4b796f9a0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h64                0x33f8c0750 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h80                0x33e9d76b0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h96                0x33e9d8110 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h112               0x4b79700b0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h128               0x4b7970830 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q4_1_h256               0x4b88c1f40 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h64                0x4b7970fa0 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h80                0x4b8b64210 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h96                0x3443f1350 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h112               0x4b88c21a0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h128               0x33f8c09b0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_0_h256               0x33e9d8610 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h64                0x33f8c1020 | th_max =  576 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h80                0x4b7971750 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h96                0x4b8b64470 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h112               0x33f8c17e0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h128               0x33e9d8b10 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q5_1_h256               0x33e9d9010 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h64                0x341724fb0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h80                0x33f8c1fb0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h96                0x4b787c490 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h112               0x4b789a700 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h128               0x4b7882c00 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_q8_0_h256               0x4b78984a0 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h128            0x34176b540 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h128           0x34170a450 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h128           0x34177d410 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h128           0x4b78909f0 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h128           0x34170aa50 | th_max =  768 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h128           0x33f8c2710 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_f16_h256            0x4b786dcb0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_0_h256           0x4b8b649d0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q4_1_h256           0x34174a860 | th_max =  896 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_0_h256           0x3417d2450 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q5_1_h256           0x4b7971fb0 | th_max =  704 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_flash_attn_ext_vec_q8_0_h256           0x33f8c2db0 | th_max =  832 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_set_f32                                0x34173e0c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_set_i32                                0x4b79727a0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_f32                            0x341791640 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_f16                            0x4b8b64d50 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)\n",
      "ggml_metal_init: loaded kernel_cpy_f16_f32                            0x33f8c3510 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f16_f16                            0x4b789b450 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)\n",
      "ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)\n",
      "ggml_metal_init: loaded kernel_cpy_f32_q8_0                           0x4b7881260 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_q4_0                           0x4b7876490 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_q4_1                           0x4b8b65120 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_q5_0                           0x33e9d7bc0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_q5_1                           0x4b7891e90 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cpy_f32_iq4_nl                         0x3417daae0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_concat                                 0x4b7972e20 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sqr                                    0x33f8c3d80 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sqrt                                   0x33f8c46c0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sin                                    0x4b7894de0 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_cos                                    0x4b7872b50 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_sum_rows                               0x4b785db10 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_argmax                                 0x4b781e420 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_pool_2d_avg_f32                        0x4b789db70 | th_max = 1024 | th_width =   32\n",
      "ggml_metal_init: loaded kernel_pool_2d_max_f32                        0x3443f0b60 | th_max = 1024 | th_width =   32\n",
      "llama_kv_cache_init: kv_size = 16384, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1\n",
      "llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512\n",
      "llama_kv_cache_init:      Metal KV buffer size =   896.00 MiB\n",
      "llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB\n",
      "llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB\n",
      "llama_new_context_with_model:      Metal compute buffer size =   956.00 MiB\n",
      "llama_new_context_with_model:        CPU compute buffer size =    39.01 MiB\n",
      "llama_new_context_with_model: graph nodes  = 986\n",
      "llama_new_context_with_model: graph splits = 2\n",
      "Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | MATMUL_INT8 = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | \n",
      "Model metadata: {'general.quantization_version': '2', 'tokenizer.chat_template': '{%- if tools %}\\n    {{- \\'<|im_start|>system\\\\n\\' }}\\n    {%- if messages[0][\\'role\\'] == \\'system\\' %}\\n        {{- messages[0][\\'content\\'] }}\\n    {%- else %}\\n        {{- \\'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\\' }}\\n    {%- endif %}\\n    {{- \"\\\\n\\\\n# Tools\\\\n\\\\nYou may call one or more functions to assist with the user query.\\\\n\\\\nYou are provided with function signatures within <tools></tools> XML tags:\\\\n<tools>\" }}\\n    {%- for tool in tools %}\\n        {{- \"\\\\n\" }}\\n        {{- tool | tojson }}\\n    {%- endfor %}\\n    {{- \"\\\\n</tools>\\\\n\\\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\\\n<tool_call>\\\\n{{\\\\\"name\\\\\": <function-name>, \\\\\"arguments\\\\\": <args-json-object>}}\\\\n</tool_call><|im_end|>\\\\n\" }}\\n{%- else %}\\n    {%- if messages[0][\\'role\\'] == \\'system\\' %}\\n        {{- \\'<|im_start|>system\\\\n\\' + messages[0][\\'content\\'] + \\'<|im_end|>\\\\n\\' }}\\n    {%- else %}\\n        {{- \\'<|im_start|>system\\\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\\\n\\' }}\\n    {%- endif %}\\n{%- endif %}\\n{%- for message in messages %}\\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\\n        {{- \\'<|im_start|>\\' + message.role + \\'\\\\n\\' + message.content + \\'<|im_end|>\\' + \\'\\\\n\\' }}\\n    {%- elif message.role == \"assistant\" %}\\n        {{- \\'<|im_start|>\\' + message.role }}\\n        {%- if message.content %}\\n            {{- \\'\\\\n\\' + message.content }}\\n        {%- endif %}\\n        {%- for tool_call in message.tool_calls %}\\n            {%- if tool_call.function is defined %}\\n                {%- set tool_call = tool_call.function %}\\n            {%- endif %}\\n            {{- \\'\\\\n<tool_call>\\\\n{\"name\": \"\\' }}\\n            {{- tool_call.name }}\\n            {{- \\'\", \"arguments\": \\' }}\\n            {{- tool_call.arguments | tojson }}\\n            {{- \\'}\\\\n</tool_call>\\' }}\\n        {%- endfor %}\\n        {{- \\'<|im_end|>\\\\n\\' }}\\n    {%- elif message.role == \"tool\" %}\\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\\n            {{- \\'<|im_start|>user\\' }}\\n        {%- endif %}\\n        {{- \\'\\\\n<tool_response>\\\\n\\' }}\\n        {{- message.content }}\\n        {{- \\'\\\\n</tool_response>\\' }}\\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\\n            {{- \\'<|im_end|>\\\\n\\' }}\\n        {%- endif %}\\n    {%- endif %}\\n{%- endfor %}\\n{%- if add_generation_prompt %}\\n    {{- \\'<|im_start|>assistant\\\\n\\' }}\\n{%- endif %}\\n', 'tokenizer.ggml.bos_token_id': '151643', 'tokenizer.ggml.padding_token_id': '151643', 'tokenizer.ggml.eos_token_id': '151645', 'tokenizer.ggml.pre': 'qwen2', 'tokenizer.ggml.model': 'gpt2', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'qwen2.rope.freq_base': '1000000.000000', 'qwen2.attention.head_count_kv': '4', 'qwen2.embedding_length': '3584', 'qwen2.context_length': '131072', 'general.type': 'model', 'qwen2.attention.head_count': '28', 'qwen2.feed_forward_length': '18944', 'general.architecture': 'qwen2', 'qwen2.block_count': '28', 'general.file_type': '12', 'general.size_label': '7.6B', 'tokenizer.ggml.add_bos_token': 'false', 'general.version': 'v0.1', 'general.finetune': 'qwen2.5-7b-instruct', 'general.name': 'qwen2.5-7b-instruct'}\n",
      "Available chat formats from metadata: chat_template.default\n",
      "Using gguf chat template: {%- if tools %}\n",
      "    {{- '<|im_start|>system\\n' }}\n",
      "    {%- if messages[0]['role'] == 'system' %}\n",
      "        {{- messages[0]['content'] }}\n",
      "    {%- else %}\n",
      "        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n",
      "    {%- endif %}\n",
      "    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n",
      "    {%- for tool in tools %}\n",
      "        {{- \"\\n\" }}\n",
      "        {{- tool | tojson }}\n",
      "    {%- endfor %}\n",
      "    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}}\\n</tool_call><|im_end|>\\n\" }}\n",
      "{%- else %}\n",
      "    {%- if messages[0]['role'] == 'system' %}\n",
      "        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n",
      "    {%- else %}\n",
      "        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n",
      "    {%- endif %}\n",
      "{%- endif %}\n",
      "{%- for message in messages %}\n",
      "    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n",
      "        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n",
      "    {%- elif message.role == \"assistant\" %}\n",
      "        {{- '<|im_start|>' + message.role }}\n",
      "        {%- if message.content %}\n",
      "            {{- '\\n' + message.content }}\n",
      "        {%- endif %}\n",
      "        {%- for tool_call in message.tool_calls %}\n",
      "            {%- if tool_call.function is defined %}\n",
      "                {%- set tool_call = tool_call.function %}\n",
      "            {%- endif %}\n",
      "            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n",
      "            {{- tool_call.name }}\n",
      "            {{- '\", \"arguments\": ' }}\n",
      "            {{- tool_call.arguments | tojson }}\n",
      "            {{- '}\\n</tool_call>' }}\n",
      "        {%- endfor %}\n",
      "        {{- '<|im_end|>\\n' }}\n",
      "    {%- elif message.role == \"tool\" %}\n",
      "        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n",
      "            {{- '<|im_start|>user' }}\n",
      "        {%- endif %}\n",
      "        {{- '\\n<tool_response>\\n' }}\n",
      "        {{- message.content }}\n",
      "        {{- '\\n</tool_response>' }}\n",
      "        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n",
      "            {{- '<|im_end|>\\n' }}\n",
      "        {%- endif %}\n",
      "    {%- endif %}\n",
      "{%- endfor %}\n",
      "{%- if add_generation_prompt %}\n",
      "    {{- '<|im_start|>assistant\\n' }}\n",
      "{%- endif %}\n",
      "\n",
      "Using chat eos_token: <|im_end|>\n",
      "Using chat bos_token: <|endoftext|>\n",
      "ggml_metal_free: deallocating\n"
     ]
    }
   ],
   "source": [
    "from llama_index.llms.llama_cpp import LlamaCPP\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")\n",
    "\n",
    "\n",
    "def messages_to_prompt(messages):\n",
    "    messages = [{\"role\": m.role.value, \"content\": m.content} for m in messages]\n",
    "    prompt = tokenizer.apply_chat_template(\n",
    "        messages, tokenize=False, add_generation_prompt=True\n",
    "    )\n",
    "    return prompt\n",
    "\n",
    "\n",
    "def completion_to_prompt(completion):\n",
    "    messages = [{\"role\": \"user\", \"content\": completion}]\n",
    "    prompt = tokenizer.apply_chat_template(\n",
    "        messages, tokenize=False, add_generation_prompt=True\n",
    "    )\n",
    "    return prompt\n",
    "\n",
    "\n",
    "llm = LlamaCPP(\n",
    "    # You can pass in the URL to a GGML model to download it automatically\n",
    "    model_url=model_url,\n",
    "    # optionally, you can set the path to a pre-downloaded model instead of model_url\n",
    "    model_path=None,\n",
    "    temperature=0.1,\n",
    "    max_new_tokens=256,\n",
    "    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room\n",
    "    context_window=16384,\n",
    "    # kwargs to pass to __call__()\n",
    "    generate_kwargs={},\n",
    "    # kwargs to pass to __init__()\n",
    "    # set to at least 1 to use GPU\n",
    "    model_kwargs={\"n_gpu_layers\": -1},\n",
    "    # transform inputs into Llama2 format\n",
    "    messages_to_prompt=messages_to_prompt,\n",
    "    completion_to_prompt=completion_to_prompt,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "445453b1",
   "metadata": {},
   "source": [
    "We can tell that the model is using `metal` and our GPU due to the logging!\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d312cf8",
   "metadata": {},
   "source": [
    "```\n",
    "offloaded 29/29 layers to GPU\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e2e6a78-7e5d-4915-bcbf-6087edb30276",
   "metadata": {},
   "source": [
    "## Start using our `LlamaCPP` LLM abstraction!\n",
    "\n",
    "We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cfaf34c-0348-415e-98bb-83f782d64fe9",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "llama_perf_context_print:        load time =     699.52 ms\n",
      "llama_perf_context_print: prompt eval time =       0.00 ms /    42 tokens (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:        eval time =       0.00 ms /   170 runs   (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:       total time =    4230.39 ms /   212 tokens\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Certainly! Here's a short poem about cats and dogs, blending their characteristics and interactions:\n",
      "\n",
      "In the quiet of the evening,\n",
      "Where shadows play and light,\n",
      "Two creatures, one in each corner,\n",
      "Of the world, they find their might.\n",
      "\n",
      "The cat, with grace and stealth,\n",
      "In the dark, she pounces,\n",
      "While the dog, with loyal teeth,\n",
      "Protects and never fumbles.\n",
      "\n",
      "One with fur as soft as silk,\n",
      "The other with fur like flax,\n",
      "Yet in their hearts, a bond is built,\n",
      "A friendship that will never lack.\n",
      "\n",
      "Though their paths may seem apart,\n",
      "In their hearts, they find a way,\n",
      "To coexist, in peace and heart,\n",
      "A tale of two, in harmony's sway.\n",
      "\n",
      "This poem captures the essence of both cats and dogs, highlighting their unique traits and the potential for friendship between them.\n"
     ]
    }
   ],
   "source": [
    "response = llm.complete(\"Hello! Can you tell me a poem about cats and dogs?\")\n",
    "print(response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9038f7d7",
   "metadata": {},
   "source": [
    "We can use the `stream_complete` endpoint to stream the response as it’s being generated rather than waiting for the entire response to be generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b059409-cd9d-4651-979c-03b3943e94af",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sure, here's a"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Llama.generate: 24 prefix-match hit, remaining 15 prompt tokens to eval\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " poem about fast cars:\n",
      "\n",
      "In the glow of twilight's soft embrace,\n",
      "Rumble of engines, a deep, steady thrum,\n",
      "Metal beasts, sleek and poised to race,\n",
      "Through the night, their shadows run.\n",
      "\n",
      "Wind whispers secrets as it streams,\n",
      "Past the curves, the corners, the bends,\n",
      "Adrenaline pulses, hearts in dreams,\n",
      "As they chase the horizon, the open ends.\n",
      "\n",
      "Neons flicker, red and blue,\n",
      "In the darkness, a vibrant hue,\n",
      "Each car a story, a tale to chew,\n",
      "On the thrill of speed, the rush of new.\n",
      "\n",
      "Steel and glass, a silent crew,\n",
      "In the silence, a deafening sound,\n",
      "Of the road, of the wind, of the crowd,\n",
      "In the moment, time seems to be bound.\n",
      "\n",
      "But the night is long, the journey vast,\n",
      "And the dawn will come, the race will pass,\n",
      "Yet in memory, the thrill will last,\n",
      "A fleeting dream, a fast car's class."
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "llama_perf_context_print:        load time =     699.52 ms\n",
      "llama_perf_context_print: prompt eval time =       0.00 ms /    15 tokens (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:        eval time =       0.00 ms /   201 runs   (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:       total time =    4365.20 ms /   216 tokens\n"
     ]
    }
   ],
   "source": [
    "response_iter = llm.stream_complete(\"Can you write me a poem about fast cars?\")\n",
    "for response in response_iter:\n",
    "    print(response.delta, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7617600",
   "metadata": {},
   "source": [
    "## Query engine set up with LlamaCPP\n",
    "\n",
    "We can simply pass in the `LlamaCPP` LLM abstraction to the `LlamaIndex` query engine as usual.\n",
    "\n",
    "But first, let's change the global tokenizer to match our LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8ff0c0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import set_global_tokenizer\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "set_global_tokenizer(\n",
    "    AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\").encode\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c6f564",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use Huggingface embeddings\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "\n",
    "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d485f1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"../data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c55c33cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "\n",
    "# create vector store index\n",
    "index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e07659c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up query engine\n",
    "query_engine = index.as_query_engine(llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64e095c5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Llama.generate: 24 prefix-match hit, remaining 1983 prompt tokens to eval\n",
      "llama_perf_context_print:        load time =     699.52 ms\n",
      "llama_perf_context_print: prompt eval time =       0.00 ms /  1983 tokens (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:        eval time =       0.00 ms /   172 runs   (    0.00 ms per token,      inf tokens per second)\n",
      "llama_perf_context_print:       total time =    8297.83 ms /  2155 tokens\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Growing up, the author focused on writing and programming. Specifically, before college, he wrote short stories as a beginning writer, which he found to be lacking in plot but rich in character emotions. He also spent time programming on an IBM 1401 computer in his school district's basement, using an early version of Fortran. He had to input programs on punch cards and run them on a noisy printer. This experience was limited and not very productive, as he couldn't find much to do with the machine without input data or the ability to perform more complex calculations. The author then shifted to using microcomputers, which allowed for more interactive programming and writing, and he began to write essays again. Later, in March 2015, he started working on Lisp again, attracted by its unique characteristics and power as a programming language.\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
